Loan Default Prediction¶

Problem Definition¶

The Context:¶

The bank is concerned about loan defaults, as they significantly impact profits. The current loan approval process is manual, effort-intensive, and prone to human error and biases. Although there have been attempts to automate this using heuristics, the focus now is on using data science and machine learning to make the process more efficient and unbiased. It's crucial that any automated system doesn't inherit the human biases. The bank wants to establish a credit scoring model compliant with the Equal Credit Opportunity Act's guidelines, using data from recent approved applicants. This model should be predictive but also interpretable to justify any loan rejections. The objective is to build a classification model to predict potential defaulters and recommend important loan approval features to the bank.

  • Why is this problem important to solve?

The primary goal is to build a classification model that predicts which clients are likely to default on their loans. Additionally, the model should provide recommendations to the bank about the important features to consider during the loan approval process.

The objective:¶

  • What is the intended goal?

The primary goal is to build a classification model that predicts which clients are likely to default on their loans. Additionally, the model should provide recommendations to the bank about the important features to consider during the loan approval process.

The key questions:¶

  • How can the bank predict potential loan defaulters with higher accuracy?
  • How can the bank automate and optimize the loan approval process without introducing or perpetuating biases?
  • What are the most important features or factors that the bank should consider when approving loans to minimize defaults?
  • How can the bank ensure the model's decisions are interpretable and justifiable, especially in cases of loan rejections?
  • What are the key questions that need to be answered?

The problem formulation:¶

  • What is it that we are trying to solve using data science?
  • Predicting which clients are more likely to default on their loans.
  • Identifying the most influential features that impact loan defaults.
  • Building an interpretable model that can not only predict but also provide justifications for its predictions, ensuring compliance with regulations like the Equal Credit Opportunity Act.

Data Description:¶

The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.

  • BAD: 1 = Client defaulted on loan, 0 = loan repaid

  • LOAN: Amount of loan approved.

  • MORTDUE: Amount due on the existing mortgage.

  • VALUE: Current value of the property.

  • REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)

  • JOB: The type of job that loan applicant has such as manager, self, etc.

  • YOJ: Years at present job.

  • DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).

  • DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).

  • CLAGE: Age of the oldest credit line in months.

  • NINQ: Number of recent credit inquiries.

  • CLNO: Number of existing credit lines.

  • DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.

Import the necessary libraries and Data¶

In [1]:
import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

# To scale the data using z-score
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split

# Algorithms to use
from sklearn import tree

from sklearn.linear_model import LogisticRegression

from sklearn.neighbors import KNeighborsClassifier

# Metrics to evaluate the model
from sklearn.metrics import confusion_matrix, classification_report, precision_recall_curve

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

# Metrics to evaluate the model
from sklearn import metrics

from sklearn.metrics import confusion_matrix, classification_report,recall_score,precision_score, accuracy_score

# For tuning the model
from sklearn.model_selection import GridSearchCV

# Setting the random seed to 1 for reproducibility of results
import random
random.seed(1)
np.random.seed(1)

# To ignore warnings
import warnings
warnings.filterwarnings("ignore")

Data Overview¶

  • Reading the dataset
  • Understanding the shape of the dataset
  • Checking the data types
  • Checking for missing values
  • Checking for duplicated values
In [2]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [3]:
#loading the datasets
data= pd.read_csv('/content/drive/MyDrive/Data science MIT/Capstone project/hmeq.csv')
In [4]:
#check the first 5 rows in the dataset
data.head()
Out[4]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 2.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [5]:
#check the last 5 rows in the dataset
data.tail()
Out[5]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
5955 0 88900 57264.0 90185.0 DebtCon Other 16.0 0.0 0.0 221.808718 0.0 16.0 36.112347
5956 0 89000 54576.0 92937.0 DebtCon Other 16.0 0.0 0.0 208.692070 0.0 15.0 35.859971
5957 0 89200 54045.0 92924.0 DebtCon Other 15.0 0.0 0.0 212.279697 0.0 15.0 35.556590
5958 0 89800 50370.0 91861.0 DebtCon Other 14.0 0.0 0.0 213.892709 0.0 16.0 34.340882
5959 0 89900 48811.0 88934.0 DebtCon Other 15.0 0.0 0.0 219.601002 0.0 16.0 34.571519
In [6]:
#check the number of rows and columns in the dataset
data.shape
Out[6]:
(5960, 13)
In [7]:
data.dtypes
Out[7]:
BAD          int64
LOAN         int64
MORTDUE    float64
VALUE      float64
REASON      object
JOB         object
YOJ        float64
DEROG      float64
DELINQ     float64
CLAGE      float64
NINQ       float64
CLNO       float64
DEBTINC    float64
dtype: object
In [8]:
# Let us see the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5960 entries, 0 to 5959
Data columns (total 13 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   BAD      5960 non-null   int64  
 1   LOAN     5960 non-null   int64  
 2   MORTDUE  5442 non-null   float64
 3   VALUE    5848 non-null   float64
 4   REASON   5708 non-null   object 
 5   JOB      5681 non-null   object 
 6   YOJ      5445 non-null   float64
 7   DEROG    5252 non-null   float64
 8   DELINQ   5380 non-null   float64
 9   CLAGE    5652 non-null   float64
 10  NINQ     5450 non-null   float64
 11  CLNO     5738 non-null   float64
 12  DEBTINC  4693 non-null   float64
dtypes: float64(9), int64(2), object(2)
memory usage: 605.4+ KB

Observation

  • From above, we can see above, apart from the BAD variable (the target variable) which is of the int64 datatype and is categorical, the other dependent features have 10 numerical and 2 strings datatypes.
  • There are 5960 rows and 13 columns in the original dataset.
  • The columns YOJ, DEROG, DELINQ, NINQ and CLNO are in float64 datatypes but needs to be converted to int64 while LOAN in int64 should be converted to float64 because loan is the amount of money to be borrowed
  • Not all the columns have the same number of rows, so there are missing values in the data

Let's check the unique values in each column

In [9]:
# Checking the count of unique values in each column, nunique can give us information on columns that are categorial or numerical
data.nunique()
Out[9]:
BAD           2
LOAN        540
MORTDUE    5053
VALUE      5381
REASON        2
JOB           6
YOJ          99
DEROG        11
DELINQ       14
CLAGE      5314
NINQ         16
CLNO         62
DEBTINC    4693
dtype: int64

Observations:

  • we have 3 categorical columns and 10 continous columns.

Duplicated and Missing Values¶

In [10]:
# Let check for missing values
data.duplicated().sum()
Out[10]:
0

There are no duplicated rows in the data

In [11]:
#check for null values
data.isnull().sum()
Out[11]:
BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64
In [12]:
#percentage of missing values in the dataframe
round(data.isnull().sum() / data.shape[0]*100,2).sort_values(ascending=True)
Out[12]:
BAD         0.00
LOAN        0.00
VALUE       1.88
CLNO        3.72
REASON      4.23
JOB         4.68
CLAGE       5.17
NINQ        8.56
YOJ         8.64
MORTDUE     8.69
DELINQ      9.73
DEROG      11.88
DEBTINC    21.26
dtype: float64

Observation

  • No missing values in the column BAD and LOAN
  • All other colums have mising values, with column DEBTINC and DEROG having the highest missing values ~21.26% and 11.88% respectively . The missing values have to be treated to get a good model for our analysis.
  • The percentage of missing values is not up to 50%, hence we can't delete the colums instead we would treat the missing values.

Class Distribution¶

In [13]:
### Percentage  distribution of the target variable "BAD"
data['BAD'].value_counts(1)*100
Out[13]:
0    80.050336
1    19.949664
Name: BAD, dtype: float64

Observation

  • 80% of the clients in this dataset would repay the loan
  • About 20% of the clients would default on the loan.
  • This can hence be considered a somewhat imbalanced classification problem

Summary Statistics¶

In [14]:
data.describe().round(2).T
Out[14]:
count mean std min 25% 50% 75% max
BAD 5960.0 0.20 0.40 0.00 0.00 0.00 0.00 1.00
LOAN 5960.0 18607.97 11207.48 1100.00 11100.00 16300.00 23300.00 89900.00
MORTDUE 5442.0 73760.82 44457.61 2063.00 46276.00 65019.00 91488.00 399550.00
VALUE 5848.0 101776.05 57385.78 8000.00 66075.50 89235.50 119824.25 855909.00
YOJ 5445.0 8.92 7.57 0.00 3.00 7.00 13.00 41.00
DEROG 5252.0 0.25 0.85 0.00 0.00 0.00 0.00 10.00
DELINQ 5380.0 0.45 1.13 0.00 0.00 0.00 0.00 15.00
CLAGE 5652.0 179.77 85.81 0.00 115.12 173.47 231.56 1168.23
NINQ 5450.0 1.19 1.73 0.00 0.00 1.00 2.00 17.00
CLNO 5738.0 21.30 10.14 0.00 15.00 20.00 26.00 71.00
DEBTINC 4693.0 33.78 8.60 0.52 29.14 34.82 39.00 203.31

Observations from Summary Statistics

  • The maximum and minimum amount of loan requested by the client are 89,900 and 1,100 respectively
  • The maximum and minimum mortgage left for the client are 399550.00 and 2063.00 respectively
  • The maximum and minimum value of the property are 855,909 and 8,000 respectively
  • The maximum years at the client current job is 41 years
  • The maximum of Number of major derogatory reports(DEROG) is 10
  • The maximum number of delinquent credit lines (DELINQ) is 15
  • The maximum and minium age of the oldest credit line in months (CLAGE) are 1168.23 and 115.12
  • The maxmium Number of recent credit inquiries (NINQ) is 17
  • The mean of existing credit lines (CLNO) is 21.30 and maximum is 71
  • The mean Debt-to-income ratio (DEBTINC) is 33.78, and the maximum and minimum are 203.31 and 0.52
In [ ]:
#check unique values in each columns
for column in data.columns:
    print(f"Column: {column}")
    print(data[column].value_counts(normalize=True))
    print('-' * 50)
Column: BAD
0    0.800503
1    0.199497
Name: BAD, dtype: float64
--------------------------------------------------
Column: LOAN
15000    0.017617
10000    0.013591
20000    0.012416
25000    0.012248
12000    0.011577
           ...   
47700    0.000168
47100    0.000168
46900    0.000168
46700    0.000168
89900    0.000168
Name: LOAN, Length: 540, dtype: float64
--------------------------------------------------
Column: MORTDUE
42000.0     0.002021
47000.0     0.001838
65000.0     0.001654
50000.0     0.001286
124000.0    0.001286
              ...   
65372.0     0.000184
15346.0     0.000184
58549.0     0.000184
69195.0     0.000184
48811.0     0.000184
Name: MORTDUE, Length: 5053, dtype: float64
--------------------------------------------------
Column: VALUE
60000.0     0.002565
80000.0     0.002394
85000.0     0.002052
65000.0     0.001881
78000.0     0.001710
              ...   
116994.0    0.000171
42682.0     0.000171
72175.0     0.000171
70095.0     0.000171
88934.0     0.000171
Name: VALUE, Length: 5381, dtype: float64
--------------------------------------------------
Column: REASON
DebtCon    0.688157
HomeImp    0.311843
Name: REASON, dtype: float64
--------------------------------------------------
Column: JOB
Other      0.420349
ProfExe    0.224608
Office     0.166872
Mgr        0.135011
Self       0.033973
Sales      0.019187
Name: JOB, dtype: float64
--------------------------------------------------
Column: YOJ
0.00     0.076217
1.00     0.066667
2.00     0.063728
5.00     0.061157
4.00     0.059504
           ...   
29.90    0.000184
12.90    0.000184
13.50    0.000184
0.25     0.000184
8.30     0.000184
Name: YOJ, Length: 99, dtype: float64
--------------------------------------------------
Column: DEROG
0.0     0.861957
1.0     0.082826
2.0     0.030465
3.0     0.011043
4.0     0.004379
5.0     0.002856
6.0     0.002856
7.0     0.001523
8.0     0.001142
9.0     0.000571
10.0    0.000381
Name: DEROG, dtype: float64
--------------------------------------------------
Column: DELINQ
0.0     0.776766
1.0     0.121561
2.0     0.046468
3.0     0.023978
4.0     0.014498
5.0     0.007063
6.0     0.005019
7.0     0.002416
8.0     0.000929
10.0    0.000372
11.0    0.000372
15.0    0.000186
12.0    0.000186
13.0    0.000186
Name: DELINQ, dtype: float64
--------------------------------------------------
Column: CLAGE
102.500000    0.001238
206.966667    0.001238
177.500000    0.001062
123.766667    0.001062
95.366667     0.001062
                ...   
240.856017    0.000177
196.241371    0.000177
71.461705     0.000177
184.880011    0.000177
219.601002    0.000177
Name: CLAGE, Length: 5314, dtype: float64
--------------------------------------------------
Column: NINQ
0.0     0.464404
1.0     0.245688
2.0     0.143119
3.0     0.071927
4.0     0.028624
5.0     0.013761
6.0     0.010275
7.0     0.008073
10.0    0.005138
8.0     0.004037
9.0     0.002018
11.0    0.001835
12.0    0.000367
13.0    0.000367
14.0    0.000183
17.0    0.000183
Name: NINQ, dtype: float64
--------------------------------------------------
Column: CLNO
16.0    0.055071
19.0    0.053503
24.0    0.046009
23.0    0.045138
21.0    0.040955
          ...   
58.0    0.000523
71.0    0.000349
53.0    0.000349
57.0    0.000174
63.0    0.000174
Name: CLNO, Length: 62, dtype: float64
--------------------------------------------------
Column: DEBTINC
37.113614    0.000213
44.382578    0.000213
31.614680    0.000213
41.576701    0.000213
41.395462    0.000213
               ...   
31.613930    0.000213
39.244669    0.000213
40.943866    0.000213
30.444839    0.000213
34.571519    0.000213
Name: DEBTINC, Length: 4693, dtype: float64
--------------------------------------------------

Exploratory Data Analysis (EDA) and Visualization¶

Leading Questions:

  1. What is the range of values for the loan amount variable "LOAN"?
  2. How does the distribution of years at present job "YOJ" vary across the dataset?
  3. How many unique categories are there in the REASON variable?
  4. What is the most common category in the JOB variable?
  5. Is there a relationship between the REASON variable and the proportion of applicants who defaulted on their loan?
  6. Do applicants who default have a significantly different loan amount compared to those who repay their loan?
  7. Is there a correlation between the value of the property and the loan default rate?
  8. Do applicants who default have a significantly different mortgage amount compared to those who repay their loan?

Univariate Analysis¶

In [15]:
# Get list of numerical columns
num_cols = data.select_dtypes(include=['float64', 'int64']).columns.tolist()

print(num_cols)
['BAD', 'LOAN', 'MORTDUE', 'VALUE', 'YOJ', 'DEROG', 'DELINQ', 'CLAGE', 'NINQ', 'CLNO', 'DEBTINC']
In [16]:
# Get list of categorical columns
cat_cols = data.select_dtypes(include=['object']).columns.tolist()

print(cat_cols)
['REASON', 'JOB']

We will first define a hist_box() function that provides both a boxplot and a histogram in the same visual, with which we can perform univariate analysis on the columns of this dataset.

In [17]:
# Defining the hist_box() function
def hist_box(col):
  f, (ax_box, ax_hist) = plt.subplots(2, sharex=True, gridspec_kw={'height_ratios': (0.15, 0.85)}, figsize=(15,10))
  sns.set(style='darkgrid')
  # Adding a graph in each part
  sns.boxplot(x=data[col], ax=ax_box, showmeans=True)
  sns.distplot(x=data[col], ax=ax_hist)
  ax_hist.axvline(data[col].mean(), color='green', linestyle='--') # Green line corresponds to the mean in the plot
  ax_hist.axvline(data[col].median(), color='orange', linestyle='-') # Orange line corresponds to the median in the plot
  plt.show()
In [18]:
hist_box('LOAN')

Observation

  • The loan distribution is skewed to the right, with most loans being of smaller amounts
  • It has so many outliers
In [19]:
hist_box('MORTDUE')

Observation

  • The MORTDUE distribution is skewed to the right, indicating most outstanding mortgage amounts are on the lower side.
  • It has so many outliers
In [20]:
hist_box('VALUE')

Observation

  • The distribution of the current value of the property is skewed to the right, with most property values being on the lower side.
  • It has so many outliers
In [21]:
hist_box('YOJ')

Observation

  • The YOJ distribution is skewed to the right, with many clients or applicants having fewer years on their current job
  • It has so many outliers
In [ ]:
hist_box('DEROG')

Observation

  • The DEROG distribution is skewed to the right.
  • Most values are close to zero, suggesting that a large number of clients or applicants have few or no derogatory remarks.
  • It has so many outliers and so many misisng data, Its possible that most of the clients do not have derogatory reports.
In [22]:
hist_box('DELINQ')

Observation

  • Similarly to DEROG, most values are close to zero and the distribution is skewed to the right
  • It has so many outliers and misisng data, Its possible that most of the clients do not have deliquents credit lines.
In [23]:
hist_box('DEBTINC')

Observation

  • The DEBTINC distribution is slighty skewed to the right
  • Most values are concentrated around the 30-40 range, but there are some extreme values on the right, suggesting potential outliers
  • It has so many outliers and missing data
In [25]:
hist_box('NINQ')

Observation

  • The NINQ distribution is skewed to the right
  • It seems most of the clients or applicants have fewer credit inquiries
  • It has so many outliers and misisng data, It possible that a number of the clients are new and this could be the first time of getting a loan
In [26]:
hist_box('CLNO')

Observation

  • The CLNO (number of existing line of credit) distribution is almost a normal distribution with little skewness to the right. This could be due to the presence of outliers
In [24]:
hist_box('CLAGE')

Observation

  • The CLAGE distribution appears to be binomial, it seems to be normally dributed with little skewness to the right.
  • The skewness could be to the presence of outliers

Univariate Analysis of categorical variables¶

In [27]:
cat_cols
Out[27]:
['REASON', 'JOB']
In [28]:
# Printing the % sub categories of each category.
for i in cat_cols:

    print(data[i].value_counts(normalize = True))

    print('*' * 40)
DebtCon    0.688157
HomeImp    0.311843
Name: REASON, dtype: float64
****************************************
Other      0.420349
ProfExe    0.224608
Office     0.166872
Mgr        0.135011
Self       0.033973
Sales      0.019187
Name: JOB, dtype: float64
****************************************

Observation

  • ~69% of the clients requested for loan for debt consolidation
  • ~ 42% of the clients that requested for loan were under the others job category
  • Clients under the sales category were the least to request for loans
In [30]:
# Create subplots: 3 rows by 2 columns
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(8, 5))

# The ravel function returns a flattened array
ax = axes.ravel()


for i, col in enumerate(cat_cols):
    sns.countplot(x=col, data=data, ax=ax[i], order=data[col].value_counts().index)
    ax[i].set_title(f'Distribution of {col}')
    ax[i].set_xticklabels(ax[i].get_xticklabels(), rotation=45)

# Adjust layout
plt.tight_layout()
plt.show()

Observations:¶

  • Majority of the loan request was for debt consolidation
  • Clients whose jobs falls under the category of otherrequested more loans than those in the other categories
  • Clients whose job category are under sales have the least loan request

Bivariate Analysis¶

  • What is the relationship between the loan amount and the probability of default?
  • How does the applicant's job correlate with the loan default rate?
  • Is there a noticeable trend between credit score and loan default?
  • Are certain occupations or employment sectors more prone to default than others?
  • Do applicants with a longer credit history have a lower default rate?
  • Are there differences in default rates among different loan purposes (e.g., home improvement vs. debt consolidation)?

Target and dependent numerical variables¶

In [31]:
num_cols
Out[31]:
['BAD',
 'LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']
In [32]:
cols = data[['LOAN','MORTDUE', 'VALUE','YOJ','DEROG','DELINQ','CLAGE','NINQ','CLNO','DEBTINC']].columns.tolist()
plt.figure(figsize=(15,25))

for i, variable in enumerate(cols):
                     plt.subplot(5,2,i+1)
                     sns.boxplot(x=data["BAD"],y=data[variable],palette="PuBu")
                     plt.tight_layout()
                     plt.title(variable)
plt.show()

Observations:¶

  • From the above graph, we observed that client with more LOAN, MORTDUE, and YOJ had slight impact on the amount of loan repaid. However, VALUE had no significant impact on the loan repaid -Clients with older CLAGE (Age of the oldest credit line in months ) tends repay loans than client with less CLAGE
  • We observed that clients with higher DEROG (Number of major derogatory reports) and DELINQ (delinquent credit lines), defaulted on loan payments
  • Similar pattern was seen with NINQ (number of credit line inquires) and CLNO (number of existing credit lines).
  • Also, clients with the higher the DEBTINC ratio (Debt-to-income ratio), tends to default loans than clients with lower DEBTINC
In [33]:
cat_cols
Out[33]:
['REASON', 'JOB']
In [37]:
cols = data[['REASON', 'JOB']].columns.tolist()

plt.figure(figsize=(7, 5))

for i, variable in enumerate(cols):
    # Crosstab to get the frequency for each category and status
    ct = pd.crosstab(data[variable], data['BAD'])

    # Sort the bars based on the count of 'BAD' value (e.g., 1)
    # You can change the sorting column if needed
    ct = ct.sort_values(by=1, ascending=False)

    # Plotting the stacked bar chart
    ax = plt.subplot(1, 2, i+1)
    ct.plot(kind="bar", stacked=True, ax=ax)
    plt.title(variable)
    plt.ylabel('Count')
    plt.xticks(rotation=90)
    plt.tight_layout()

plt.show()

Observation

  • From the above graph, we could observed that clients who obtain loan for purpose of Debt consolidation (DebtCon) have a higher chance of defaulting on the loan payment
  • Clients whose jobs falls under the category other have higher chances of defaulting loan payment
  • Clients with job category undersales have the least loan and are likely to repay the loan than clients under other job categories

Multivariate Analysis¶

In [38]:
#lets create a pivot table to show the summary of the data based on the categorical variables
pivot_table = pd.pivot_table(data, index=['REASON', 'JOB'], columns='BAD')
pivot_table
Out[38]:
CLAGE CLNO DEBTINC DELINQ DEROG LOAN MORTDUE NINQ VALUE YOJ
BAD 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
REASON JOB
DebtCon Mgr 179.203028 146.791449 23.420225 23.534483 35.129543 38.591110 0.234414 1.271930 0.151832 0.956140 21022.149123 19923.275862 84201.617978 80240.646552 1.646081 2.079646 106375.289474 115403.442308 9.382547 8.863158
Office 179.587257 138.171649 23.013258 20.488636 34.196836 39.559065 0.346392 1.558140 0.060606 0.469880 18994.696970 19409.782609 68879.309804 70595.511364 1.051485 1.954545 96780.312500 100250.329390 6.907129 8.965761
Other 171.000227 150.057130 19.395883 21.034161 34.190252 37.604921 0.199647 1.148734 0.227599 0.563107 20922.974036 17758.858859 61802.455794 58077.646624 1.322747 1.981132 86781.949693 84135.985669 9.707401 7.760749
ProfExe 200.220297 174.842451 25.326211 27.396825 32.142007 44.358021 0.247407 1.404762 0.104294 0.701613 19983.426184 21089.147287 93087.435537 103977.927419 0.916910 1.873016 128021.669916 136506.228814 8.619829 9.067442
Sales 227.072356 152.000930 23.015873 28.058824 33.461594 51.797315 0.206349 0.566667 0.018182 0.875000 14976.190476 18188.235294 87275.259259 75957.794118 0.174603 1.939394 121789.619048 92922.303030 6.986792 7.061765
Self 198.224972 147.821700 19.707317 26.875000 33.844055 45.729450 0.121951 1.483871 0.024390 0.586207 36419.512195 20381.250000 130548.902439 76569.266667 1.000000 1.500000 179076.560976 108193.370370 7.219512 5.650000
HomeImp Mgr 188.321009 134.204602 19.830508 20.903846 34.068242 39.192406 0.564815 1.519231 0.070000 0.600000 15305.084746 11855.357143 78330.737593 71761.224490 0.472222 1.480769 115008.510678 86740.430000 9.227966 8.282143
Office 188.823871 190.618420 18.914179 18.300000 33.414808 35.834021 0.169355 1.766667 0.084034 1.275862 16920.522388 11387.878788 65897.657895 47071.630000 0.421260 0.740741 92287.193433 71105.586207 11.735537 7.054545
Other 202.190793 129.498015 19.471058 14.222798 33.337985 37.020812 0.285403 0.794737 0.145652 0.656085 14072.994129 10938.048780 56660.921951 45106.493506 0.943277 1.361702 78509.387613 80022.457068 10.393607 6.926111
ProfExe 209.613605 173.708747 22.602410 20.985294 31.838539 31.946864 0.256024 1.179104 0.009259 0.735294 17031.325301 14605.479452 99247.081325 83517.714286 0.651235 1.484375 140341.316265 122466.220588 8.910542 9.217361
Sales 159.964610 133.731617 22.500000 32.000000 41.252041 25.650741 0.000000 0.000000 1.625000 0.500000 5800.000000 4325.000000 69501.000000 97638.333333 0.500000 0.500000 68965.250000 105413.000000 3.750000 8.250000
Self 177.498642 172.032554 24.000000 19.222222 32.338956 61.376969 0.000000 2.444444 0.159574 0.333333 27380.851064 28852.380952 92906.845238 104220.950000 1.159420 2.611111 149464.521277 147892.380952 7.227273 7.285714
In [39]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'LOAN', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[39]:
BAD 0 1
REASON JOB
DebtCon Mgr 21022.15 19923.28
Office 18994.70 19409.78
Other 20922.97 17758.86
ProfExe 19983.43 21089.15
Sales 14976.19 18188.24
Self 36419.51 20381.25
HomeImp Mgr 15305.08 11855.36
Office 16920.52 11387.88
Other 14072.99 10938.05
ProfExe 17031.33 14605.48
Sales 5800.00 4325.00
Self 27380.85 28852.38
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average Loan'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average loan obtained based on reason and job category had no much impact on loans defaulted
In [40]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'MORTDUE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[40]:
BAD 0 1
REASON JOB
DebtCon Mgr 84201.62 80240.65
Office 68879.31 70595.51
Other 61802.46 58077.65
ProfExe 93087.44 103977.93
Sales 87275.26 75957.79
Self 130548.90 76569.27
HomeImp Mgr 78330.74 71761.22
Office 65897.66 47071.63
Other 56660.92 45106.49
ProfExe 99247.08 83517.71
Sales 69501.00 97638.33
Self 92906.85 104220.95
In [41]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average MORTDUE'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average MORTDUE from the reason for loan collection and the job category of the client had no much impact on loans defaulted
In [42]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'VALUE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[42]:
BAD 0 1
REASON JOB
DebtCon Mgr 106375.29 115403.44
Office 96780.31 100250.33
Other 86781.95 84135.99
ProfExe 128021.67 136506.23
Sales 121789.62 92922.30
Self 179076.56 108193.37
HomeImp Mgr 115008.51 86740.43
Office 92287.19 71105.59
Other 78509.39 80022.46
ProfExe 140341.32 122466.22
Sales 68965.25 105413.00
Self 149464.52 147892.38
In [43]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average VALUE'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average VALUE of the property from the reason for loan collection to the job category of the client had no much impact on loans defaulted
In [44]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'YOJ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[44]:
BAD 0 1
REASON JOB
DebtCon Mgr 9.38 8.86
Office 6.91 8.97
Other 9.71 7.76
ProfExe 8.62 9.07
Sales 6.99 7.06
Self 7.22 5.65
HomeImp Mgr 9.23 8.28
Office 11.74 7.05
Other 10.39 6.93
ProfExe 8.91 9.22
Sales 3.75 8.25
Self 7.23 7.29
In [45]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average YOJ'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average YOJ had no impact on loans defaulted
In [46]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DEROG', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[46]:
BAD 0 1
REASON JOB
DebtCon Mgr 0.15 0.96
Office 0.06 0.47
Other 0.23 0.56
ProfExe 0.10 0.70
Sales 0.02 0.88
Self 0.02 0.59
HomeImp Mgr 0.07 0.60
Office 0.08 1.28
Other 0.15 0.66
ProfExe 0.01 0.74
Sales 1.62 0.50
Self 0.16 0.33
In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DEROG'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average DEROG from the reason for loan collection to the job category of the client had significant impact on loans defaulted. client with higher DEROG default loans payment
In [47]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DELINQ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[47]:
BAD 0 1
REASON JOB
DebtCon Mgr 0.23 1.27
Office 0.35 1.56
Other 0.20 1.15
ProfExe 0.25 1.40
Sales 0.21 0.57
Self 0.12 1.48
HomeImp Mgr 0.56 1.52
Office 0.17 1.77
Other 0.29 0.79
ProfExe 0.26 1.18
Sales 0.00 0.00
Self 0.00 2.44
In [48]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DELINQ'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average DELINQ from the reason for loan collection to the job category of the client had significant impact on loans defaulted. Clients with higher DELINQ, default loans payment
In [49]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'CLAGE', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[49]:
BAD 0 1
REASON JOB
DebtCon Mgr 179.20 146.79
Office 179.59 138.17
Other 171.00 150.06
ProfExe 200.22 174.84
Sales 227.07 152.00
Self 198.22 147.82
HomeImp Mgr 188.32 134.20
Office 188.82 190.62
Other 202.19 129.50
ProfExe 209.61 173.71
Sales 159.96 133.73
Self 177.50 172.03
In [50]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average CLAGE'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average CLAGE from the reason for loan collection to the job category of the client had significant impact on loans defaulted. client with lesser CLAGE, default loans payment
In [51]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'NINQ', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[51]:
BAD 0 1
REASON JOB
DebtCon Mgr 1.65 2.08
Office 1.05 1.95
Other 1.32 1.98
ProfExe 0.92 1.87
Sales 0.17 1.94
Self 1.00 1.50
HomeImp Mgr 0.47 1.48
Office 0.42 0.74
Other 0.94 1.36
ProfExe 0.65 1.48
Sales 0.50 0.50
Self 1.16 2.61
In [52]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average NINQ'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average NINQ from the reason for loan collection to the job category of the client had significant impact on loans defaulted. client with higher NINQ, default loans payment
In [53]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'CLNO', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[53]:
BAD 0 1
REASON JOB
DebtCon Mgr 23.42 23.53
Office 23.01 20.49
Other 19.40 21.03
ProfExe 25.33 27.40
Sales 23.02 28.06
Self 19.71 26.88
HomeImp Mgr 19.83 20.90
Office 18.91 18.30
Other 19.47 14.22
ProfExe 22.60 20.99
Sales 22.50 32.00
Self 24.00 19.22
In [54]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average CLNO'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average CLNO from the reason for loan collection to the job category of the client had no significant impact on loans defaulted.
In [55]:
#lets create a pivot table to show the summary of the data
pivot_table = pd.pivot_table(data, 'DEBTINC', index=['REASON', 'JOB'], columns='BAD').round(2)
pivot_table
Out[55]:
BAD 0 1
REASON JOB
DebtCon Mgr 35.13 38.59
Office 34.20 39.56
Other 34.19 37.60
ProfExe 32.14 44.36
Sales 33.46 51.80
Self 33.84 45.73
HomeImp Mgr 34.07 39.19
Office 33.41 35.83
Other 33.34 37.02
ProfExe 31.84 31.95
Sales 41.25 25.65
Self 32.34 61.38
In [56]:
import seaborn as sns
import matplotlib.pyplot as plt

# Plotting the heatmap
plt.figure(figsize=(12, 8))  # You can adjust the size as needed
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='g', cbar_kws={'label': 'Average DEBTINC'})
plt.title('Loan Default by Reason and Job')
plt.show()

Observation

  • The average DEBTINC from the reason for loan collection to the job category of the client had significant impact on loans defaulted. client with higher DEBTINC, default loans payment

General Observations

  • From the graphs above we can say that clients with higher DEROG, DELINQ, NINQ, and DEBTINC regardless of their job or reason for loan collection are at higher risk of defaulting loan payments.
  • Clients with lower CLAGE i.e lowest credit history would likely defaults loan payments
  • The VALUE, MORTDUE and YOJ had little or no impact on loan defaultment or repayments.
In [58]:
#check for correlation among the numerical variables
data.corr()
Out[58]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
BAD 1.000000 -0.075099 -0.048219 -0.029954 -0.060238 0.276081 0.354107 -0.170499 0.174980 -0.004157 0.199835
LOAN -0.075099 1.000000 0.228595 0.335393 0.105728 -0.001302 -0.035144 0.088540 0.044487 0.072631 0.084735
MORTDUE -0.048219 0.228595 1.000000 0.875666 -0.088480 -0.049516 -0.001044 0.140047 0.031392 0.323631 0.154939
VALUE -0.029954 0.335393 0.875666 1.000000 0.007759 -0.048689 -0.014103 0.171238 -0.004398 0.269149 0.132174
YOJ -0.060238 0.105728 -0.088480 0.007759 1.000000 -0.065879 0.038153 0.201718 -0.071627 0.024838 -0.055891
DEROG 0.276081 -0.001302 -0.049516 -0.048689 -0.065879 1.000000 0.211832 -0.083047 0.173934 0.061884 0.017065
DELINQ 0.354107 -0.035144 -0.001044 -0.014103 0.038153 0.211832 1.000000 0.022488 0.067812 0.164639 0.052364
CLAGE -0.170499 0.088540 0.140047 0.171238 0.201718 -0.083047 0.022488 1.000000 -0.116935 0.237987 -0.046477
NINQ 0.174980 0.044487 0.031392 -0.004398 -0.071627 0.173934 0.067812 -0.116935 1.000000 0.088389 0.141344
CLNO -0.004157 0.072631 0.323631 0.269149 0.024838 0.061884 0.164639 0.237987 0.088389 1.000000 0.185539
DEBTINC 0.199835 0.084735 0.154939 0.132174 -0.055891 0.017065 0.052364 -0.046477 0.141344 0.185539 1.000000
In [59]:
# Finding the correlation between various columns of the dataset
plt.figure(figsize = (10,7))
sns.heatmap(data.corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
Out[59]:
<Axes: >

There is a strong correlation between the VALUE and the MORTDUE

  • Models such as logistic regression prefers the independent variables (or predictors) not to be highly correlated with each other. When predictors are correlated, it can introduce issues of multicollinearity, which can adversely affect the logistic regression model.

Treating Outliers¶

  • It's often a good idea to handle outliers before missing values, especially if statistical methods are used for imputation, as outliers can skew these statistics.
In [60]:
num_cols
Out[60]:
['BAD',
 'LOAN',
 'MORTDUE',
 'VALUE',
 'YOJ',
 'DEROG',
 'DELINQ',
 'CLAGE',
 'NINQ',
 'CLNO',
 'DEBTINC']
In [61]:
# outlier detection using boxplot
# selecting the numerical columns of data and adding their names in a list
num_cols = ['LOAN','MORTDUE','VALUE','YOJ','DEROG','DELINQ','CLAGE','NINQ','CLNO','DEBTINC']

plt.figure(figsize=(15, 12))

for i, variable in enumerate(num_cols):
    plt.subplot(4, 4, i + 1)
    plt.boxplot(data[variable], whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()
In [62]:
# to find the 25th percentile and 75th percentile for the numerical columns.
Q1 = data[num_cols].quantile(0.25)
Q3 = data[num_cols].quantile(0.75)

IQR = Q3 - Q1                   #Inter Quantile Range (75th percentile - 25th percentile)

lower_whisker = Q1 - 1.5*IQR    #Finding lower and upper bounds for all values. All values outside these bounds are outliers
upper_whisker = Q3 + 1.5*IQR
In [63]:
# Percentage of outliers in each column
((data[num_cols] < lower_whisker) | (data[num_cols] > upper_whisker)).sum()/data.shape[0]*100
Out[63]:
LOAN        4.295302
MORTDUE     3.926174
VALUE       5.369128
YOJ         1.526846
DEROG      12.164430
DELINQ     20.151007
CLAGE       0.788591
NINQ        2.969799
CLNO        3.674497
DEBTINC     1.577181
dtype: float64

Observation

  • DELINQ and DEROG has the highest number of outliers

Treating outliers

We will cap/clip the minimum and maximum value of these columns to the lower and upper whisker value of the boxplot found using Q1 - 1.5*IQR and Q3 + 1.5*IQR, respectively.

Note: Generally, a value of 1.5 * IQR is taken to cap the values of outliers to upper and lower whiskers but any number (example 0.5, 2, 3, etc) other than 1.5 can be chosen. The value depends upon the business problem statement.

Creating a function to floor and cap/clip outliers in a column

In [64]:
def treat_outliers(df, col):
    """
    treats outliers in a variable
    col: str, name of the numerical variable
    df: dataframe
    col: name of the column
    """
    Q1 = df[col].quantile(0.25)  # 25th quantile
    Q3 = df[col].quantile(0.75)  # 75th quantile
    IQR = Q3 - Q1                # Inter Quantile Range (75th perentile - 25th percentile)
    lower_whisker = Q1 - 1.5 * IQR
    upper_whisker = Q3 + 1.5 * IQR

    # all the values smaller than lower_whisker will be assigned the value of lower_whisker
    # all the values greater than upper_whisker will be assigned the value of upper_whisker
    # the assignment will be done by using the clip function of NumPy
    df[col] = np.clip(df[col], lower_whisker, upper_whisker)

    return df

Treating outliers in Rooms column

In [65]:
data = treat_outliers(data,'LOAN')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='LOAN')
plt.show()
In [66]:
data = treat_outliers(data,'MORTDUE')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='MORTDUE')
plt.show()
In [67]:
data = treat_outliers(data,'VALUE')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='VALUE')
plt.show()
In [68]:
data = treat_outliers(data,'YOJ')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='YOJ')
plt.show()
In [69]:
data = treat_outliers(data,'DEROG')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DEROG')
plt.show()
In [70]:
data = treat_outliers(data,'DELINQ')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DELINQ')
plt.show()
In [71]:
data = treat_outliers(data,'CLAGE')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='CLAGE')
plt.show()
In [72]:
data = treat_outliers(data,'NINQ')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='NINQ')
plt.show()
In [73]:
data = treat_outliers(data,'CLNO')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='CLNO')
plt.show()
In [74]:
data = treat_outliers(data,'DEBTINC')

# visualizing the column after outlier treatment
sns.boxplot(data=data,x='DEBTINC')
plt.show()

Observation

  • Outliers have been treated.

Treating Missing Values¶

In [81]:
data.head()
Out[81]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
0 1 1100 25860.0 39025.0 HomeImp Other 10.5 0.0 0.0 94.366667 1.0 9.0 NaN
1 1 1300 70053.0 68400.0 HomeImp Other 7.0 0.0 0.0 121.833333 0.0 14.0 NaN
2 1 1500 13500.0 16700.0 HomeImp Other 4.0 0.0 0.0 149.466667 1.0 10.0 NaN
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 0 1700 97800.0 112000.0 HomeImp Office 3.0 0.0 0.0 93.333333 0.0 14.0 NaN
In [82]:
data.isnull().sum()
Out[82]:
BAD           0
LOAN          0
MORTDUE     518
VALUE       112
REASON      252
JOB         279
YOJ         515
DEROG       708
DELINQ      580
CLAGE       308
NINQ        510
CLNO        222
DEBTINC    1267
dtype: int64
In [83]:
#percentage of missing values in the dataframe
round(data.isnull().sum()[data.isnull().sum()>0] / len(data)*100,2).sort_values(ascending=True)
Out[83]:
VALUE       1.88
CLNO        3.72
REASON      4.23
JOB         4.68
CLAGE       5.17
NINQ        8.56
YOJ         8.64
MORTDUE     8.69
DELINQ      9.73
DEROG      11.88
DEBTINC    21.26
dtype: float64
In [84]:
# Calculate the percentage of missing values for each column
missing_percentage = (data.isnull().sum() / len(data)) * 100

# Sort the values in descending order for better visualization
missing_percentage_sorted = missing_percentage.sort_values(ascending=False)

# Plot the bar graph
plt.figure(figsize=(12, 8))
sns.barplot(x=missing_percentage_sorted.index, y=missing_percentage_sorted.values)
plt.xticks(rotation=45)
plt.ylabel('Percentage (%)')
plt.title('Percentage of Missing Values by Column')
plt.show()

Observation

  • Out the 13 columns, only two colums, BAD and LOAN have no missing values
  • The other columns have missing values
  • DEBTINC has the highest number of missing value, with ~21% missing values

Filling missing values for categorial values

In [85]:
# extracting all the information of other variables where Distance is null
data.loc[data['REASON'].isnull()==True]
Out[85]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
3 1 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
10 1 2000 22608.0 NaN NaN NaN 18.0 NaN NaN NaN NaN NaN NaN
13 0 2000 64536.0 87400.000 NaN Mgr 2.5 0.0 0.0 147.133333 0.0 24.0 NaN
17 1 2200 23030.0 NaN NaN NaN 19.0 NaN NaN NaN NaN NaN 14.345367
51 0 3100 NaN 70400.000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5435 1 32000 92400.0 200447.375 NaN Self 2.5 0.0 0.0 165.333333 1.0 40.0 NaN
5468 1 32900 NaN 55657.000 NaN NaN 1.0 0.0 0.0 210.012265 5.0 15.0 14.345367
5803 1 41600 159306.0 NaN NaN Other 0.0 0.0 0.0 160.333333 2.0 25.0 NaN
5826 1 41600 159306.0 NaN NaN Other 0.0 0.0 0.0 142.343205 2.0 25.0 14.345367
5900 1 41600 46126.0 83800.000 NaN Other 28.0 0.0 0.0 339.900000 1.0 23.0 NaN

252 rows × 13 columns

In [86]:
#count the number of unique values
data['REASON'].value_counts()
Out[86]:
DebtCon    3928
HomeImp    1780
Name: REASON, dtype: int64

DebtCon has the highest entries, hence the mode.

In [87]:
#imputing the missing values with the mode
value_to_fill = 'DebtCon'  # Change this to your specific value
data['REASON'].fillna(value_to_fill, inplace=True)
In [88]:
#count the number of unique values
data['JOB'].value_counts()
Out[88]:
Other      2388
ProfExe    1276
Office      948
Mgr         767
Self        193
Sales       109
Name: JOB, dtype: int64

Other has the highest entries, hence the mode.

In [89]:
#imputing the missing values with the mode
value_to_fill = 'Other'  # Change this to your specific value
data['JOB'].fillna(value_to_fill, inplace=True)
In [90]:
# extracting all the information of other variables where REASON is null
data.loc[data['REASON'].isnull()==True]
Out[90]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
In [91]:
# extracting all the information of other variables where JOB is null
data.loc[data['JOB'].isnull()==True]
Out[91]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC

Observation The missing values for REASON andJOBhave been imputed.

In [92]:
# extracting all the information of other variables where CLNO is null
data.loc[data['CLNO'].isnull()==True]
Out[92]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
3 1 1500 NaN NaN DebtCon Other NaN NaN NaN NaN NaN NaN NaN
10 1 2000 22608.0 NaN DebtCon Other 18.0 NaN NaN NaN NaN NaN NaN
17 1 2200 23030.0 NaN DebtCon Other 19.0 NaN NaN NaN NaN NaN 14.345367
51 0 3100 NaN 70400.0 DebtCon Other NaN NaN NaN NaN NaN NaN NaN
63 1 3600 61584.0 61800.0 HomeImp ProfExe 10.0 NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4680 0 24600 NaN 146804.0 DebtCon Other NaN NaN NaN NaN NaN NaN 17.263535
4789 0 25100 85337.0 104607.0 HomeImp Other 6.0 NaN NaN NaN NaN NaN 27.950475
4880 0 25600 NaN 147598.0 DebtCon Other NaN NaN NaN NaN NaN NaN 14.461987
4899 0 25700 85417.0 98179.0 HomeImp Other 7.0 NaN NaN NaN NaN NaN 30.829477
4947 0 26100 NaN 151429.0 DebtCon Other NaN NaN NaN NaN NaN NaN 15.567001

222 rows × 13 columns

It looks like wherever CLNO is null, the data points in other columns are also missing. Let's check this.

In [93]:
data.loc[data['CLNO'].isnull()==True,'NINQ']
Out[93]:
3      NaN
10     NaN
17     NaN
51     NaN
63     NaN
        ..
4680   NaN
4789   NaN
4880   NaN
4899   NaN
4947   NaN
Name: NINQ, Length: 222, dtype: float64
In [94]:
data.loc[data['CLNO'].isnull()==True,'NINQ'].value_counts(dropna=False)
Out[94]:
NaN    222
Name: NINQ, dtype: int64
In [95]:
data.loc[data['CLNO'].isnull()==True,'CLAGE'].value_counts(dropna=False)
Out[95]:
NaN    222
Name: CLAGE, dtype: int64
In [96]:
data.loc[data['CLNO'].isnull()==True,'DELINQ'].value_counts(dropna=False)
Out[96]:
NaN    222
Name: DELINQ, dtype: int64
In [97]:
data.loc[data['CLNO'].isnull()==True,'DEROG'].value_counts(dropna=False)
Out[97]:
NaN    222
Name: DEROG, dtype: int64
  • There seems to be a strong pattern in missing values, as wherever the CLNO column has missing data the other columns like NINQ, CLAGE, DEROG, and DELINQ also have missing values.
  • The number of missing data in each column is consistent, there are 222 missing data in the 5 column mentioned above.
  • Let's see if the missing data has some pattern in VALUE, YOJ and MORTDUE.
In [98]:
data.loc[data['CLNO'].isnull()==True,'YOJ'].value_counts(dropna=False)
Out[98]:
NaN      117
0.00      14
8.00      14
7.00       8
6.00       8
3.00       8
9.00       7
10.00      7
2.00       5
5.00       5
4.00       5
19.00      4
1.00       4
13.00      3
2.90       1
3.50       1
21.00      1
15.00      1
8.50       1
0.75       1
0.30       1
24.00      1
23.00      1
20.00      1
12.00      1
18.00      1
0.80       1
Name: YOJ, dtype: int64
In [99]:
data.loc[data['CLNO'].isnull()==True,'MORTDUE'].value_counts(dropna=False)
Out[99]:
NaN         81
31000.0      2
159306.0     2
83830.0      1
60667.0      1
            ..
77916.0      1
84666.0      1
80490.0      1
82972.0      1
85417.0      1
Name: MORTDUE, Length: 140, dtype: int64
In [100]:
data.loc[data['CLNO'].isnull()==True,'VALUE'].value_counts(dropna=False)
Out[100]:
NaN           18
200447.375     2
80037.000      1
76965.000      1
123339.000     1
              ..
123613.000     1
93589.000      1
125912.000     1
32365.000      1
151429.000     1
Name: VALUE, Length: 204, dtype: int64

Observation

  • There seems to be no pattern of missing values with respect to MORTDUE, YOJ and VALUE column.

Missing value treatment for CLNO, NINQ, CLAGE, DEROG and DELINQ columns

  • One of the approaches to treat the missing values of these columns would be to group the data on the basis of REASON and JOB to get a better idea of the average number of CLNO, NINQ, CLAGE, DEROG, DELINQ.
  • It is more likely that clients of certain type of Job could obtain similar amount of loans and have similar CLNO, CLAGE, NINQ, DEROG, and DELINQ.
In [101]:
# checking the counts of CLNO, NINQ, CLAGE, DEROG and DELINQ
data.groupby(['REASON','JOB'])[['CLNO','NINQ','CLAGE','DEROG','DELINQ']].value_counts()
Out[101]:
REASON   JOB   CLNO  NINQ  CLAGE       DEROG  DELINQ
DebtCon  Mgr   5.0   0.0   91.431005   0.0    0.0       1
               22.0  0.0   102.972079  0.0    0.0       1
                           189.768060  0.0    0.0       1
                           188.140417  0.0    0.0       1
                           187.713486  0.0    0.0       1
                                                       ..
HomeImp  Self  21.0  1.0   178.578482  0.0    0.0       1
                           173.214284  0.0    0.0       1
                           169.343724  0.0    0.0       1
                     0.0   190.229251  0.0    0.0       1
               42.5  3.0   255.415514  0.0    0.0       1
Length: 5017, dtype: int64
In [102]:
# checking the average number of CLNO, NINQ, CLAGE, DEROG and DELINQ
data.groupby(['REASON','JOB'])[['CLNO','NINQ','CLAGE','DEROG','DELINQ']].mean()
Out[102]:
CLNO NINQ CLAGE DEROG DELINQ
REASON JOB
DebtCon Mgr 23.354811 1.488246 175.284110 0.0 0.0
Office 21.961897 1.111290 175.058635 0.0 0.0
Other 19.377332 1.275547 167.132922 0.0 0.0
ProfExe 25.265258 1.049043 195.232533 0.0 0.0
Sales 24.561856 0.697917 200.605195 0.0 0.0
Self 22.089744 1.205128 171.511805 0.0 0.0
HomeImp Mgr 20.073529 0.768750 171.767755 0.0 0.0
Office 18.377517 0.451957 189.004531 0.0 0.0
Other 17.311724 0.906475 181.177541 0.0 0.0
ProfExe 22.068750 0.778351 191.634412 0.0 0.0
Sales 25.625000 0.500000 151.220279 0.0 0.0
Self 22.209821 1.425287 176.661493 0.0 0.0

Observation

  • We have received the mean number of CLNO, NINQ, CLAGE, DEROG and DELINQ for each type of REASON in a specific JOB.
  • We will use fillna() function and transform method of pandas to impute the missing values.

  • fillna() Function - The fillna() function is used to fill NaN values using the provide input value.

     Syntax of fillna():  data['column'].fillna(value = x)

  • transform function - The transform() function works on each value of a DataFrame and allows to execute a specified function on each value.

    Sytanx of transform function: data.transform(func = function name)

    • func - A function to be executed on the values of the DataFrame.
In [103]:
# imputing missing values in CLNO column
data['CLNO'] = data['CLNO'].fillna(value = data.groupby(['REASON','JOB'])['CLNO'].transform('mean'))
In [104]:
# imputing missing values in NINQ column
data['NINQ'] = data['NINQ'].fillna(value = data.groupby(['REASON','JOB'])['NINQ'].transform('mean'))
In [105]:
# imputing missing values in CLAGE column
data['CLAGE'] = data['CLAGE'].fillna(value = data.groupby(['REASON','JOB'])['CLAGE'].transform('mean'))
In [106]:
# imputing missing values in DEROG column
data['DEROG'] = data['DEROG'].fillna(value = data.groupby(['REASON','JOB'])['DEROG'].transform('mean'))
In [107]:
# imputing missing values in DELINQcolumn
data['DELINQ'] = data['DELINQ'].fillna(value = data.groupby(['REASON','JOB'])['DELINQ'].transform('mean'))
In [108]:
# checking if all the missing values were imputed in CLNO, NINQ, CLAGE, DEROG, and DELINQ
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
Out[108]:
Count Percentage
MORTDUE 518 8.691275
VALUE 112 1.879195
YOJ 515 8.640940
DEBTINC 1267 21.258389
  • We see that the missing values have been imputed.
  • Let's convert all the values of CLNO, NINQ, CLAGE, DEROG and DELINQ to integer type as these columns have discrete values.
In [109]:
#change the datatypes from float to int
data['CLAGE'] = data['CLAGE'].round().astype(int)
data['DEROG'] = data['DEROG'].round().astype(int)
data['DELINQ'] = data['DELINQ'].round().astype(int)
data['NINQ'] = data['NINQ'].round().astype(int)
data['CLNO'] = data['CLNO'].round().astype(int)
In [110]:
data.dtypes
Out[110]:
BAD          int64
LOAN         int64
MORTDUE    float64
VALUE      float64
REASON      object
JOB         object
YOJ        float64
DEROG        int64
DELINQ       int64
CLAGE        int64
NINQ         int64
CLNO         int64
DEBTINC    float64
dtype: object
In [111]:
# checking if all the missing values were imputed in CLNO, NINQ, CLAGE, DEROG, and DELINQ
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
Out[111]:
Count Percentage
MORTDUE 518 8.691275
VALUE 112 1.879195
YOJ 515 8.640940
DEBTINC 1267 21.258389
  • We have 4 columns left with missing values
  • MORTDUE and YOJ have similar amount of missing values, let's see if there is any pattern in these two columns
In [112]:
# extracting all the information of other variables where MORTDUE is null
data.loc[data['MORTDUE'].isnull()==True]
Out[112]:
BAD LOAN MORTDUE VALUE REASON JOB YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC
3 1 1500 NaN NaN DebtCon Other NaN 0 0 167 1 19 NaN
9 1 2000 NaN 62250.0 HomeImp Sales 16.0 0 0 116 0 13 NaN
24 1 2400 NaN 17180.0 HomeImp Other NaN 0 0 15 3 4 NaN
40 1 3000 NaN 8800.0 HomeImp Other 2.0 0 0 78 0 3 NaN
41 1 3000 NaN 33000.0 HomeImp Other 1.0 0 0 23 1 2 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ...
5880 0 41600 NaN 84205.0 HomeImp Other NaN 0 0 340 0 7 22.639940
5883 0 41600 NaN 81322.0 HomeImp Self 9.0 0 0 171 0 22 24.709060
5884 0 41600 NaN 91309.0 HomeImp Other NaN 0 0 350 0 6 22.061330
5930 1 41600 NaN 85000.0 DebtCon Other 1.0 0 0 117 5 23 NaN
5931 1 41600 NaN 89609.0 DebtCon Other 1.0 0 0 116 5 23 41.105569

518 rows × 13 columns

In [113]:
# Plotting the distribution of "MORTDUE"
sns.histplot(data['MORTDUE'], bins=30, kde=True)
plt.show()

# Checking the summary statistics
print(data['MORTDUE'].describe())
count      5442.000000
mean      71566.093752
std       37203.654400
min        2063.000000
25%       46276.000000
50%       65019.000000
75%       91488.000000
max      159306.000000
Name: MORTDUE, dtype: float64
In [114]:
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()
  • As MORTDUE column seems to be normally distributed with little skeweness to the right, hence using average value for imputation might be the correct method
In [115]:
# checking the average number of MORTDUE
data.groupby(['REASON','JOB'])[['MORTDUE']].mean()
Out[115]:
MORTDUE
REASON JOB
DebtCon Mgr 81412.883162
Office 67915.683200
Other 61002.888087
ProfExe 89500.977056
Sales 81161.852273
Self 101030.421053
HomeImp Mgr 73780.208025
Office 63644.349537
Other 53643.771285
ProfExe 89176.335821
Sales 77174.818182
Self 83019.442308
In [116]:
# imputing missing values in MORTDUE column
data['MORTDUE'] = data['MORTDUE'].fillna(value = data.groupby(['REASON','JOB'])['MORTDUE'].transform('mean'))
In [117]:
#check the distribution after imputation of missing values
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()

Imputation of the mean didn't affect the overall distrution of the column

In [118]:
# checking if all the missing values were imputed in MORTDUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
Out[118]:
Count Percentage
VALUE 112 1.879195
YOJ 515 8.640940
DEBTINC 1267 21.258389

Missing value treatment for VALUE

In [119]:
# Plotting the distribution of "VALUE"
sns.histplot(data['VALUE'], bins=30, kde=True)
plt.show()

# Checking the summary statistics
print(data['VALUE'].describe())
count      5848.000000
mean      98538.057633
std       45070.800236
min        8000.000000
25%       66075.500000
50%       89235.500000
75%      119824.250000
max      200447.375000
Name: VALUE, dtype: float64

As VALUE column seems to be normally distributed with little skeweness to the right, hence using average value for imputation might be the correct method

In [120]:
# checking the average number of VALUE
data.groupby(['REASON','JOB'])[['VALUE']].mean()
Out[120]:
VALUE
REASON JOB
DebtCon Mgr 108783.151033
Office 95515.978626
Other 85908.692319
ProfExe 121511.691065
Sales 106791.007812
Self 135942.748288
HomeImp Mgr 102486.195000
Office 90215.395926
Other 75576.722381
ProfExe 125971.866875
Sales 81114.500000
Self 137957.218478
In [121]:
# imputing missing values in VALUE column
data['VALUE'] = data['VALUE'].fillna(value = data.groupby(['REASON','JOB'])['VALUE'].transform('mean'))
In [122]:
#recheck the distribution after imputation of missing values
sns.displot(data=data,x='MORTDUE',kind='kde')
plt.show()
In [123]:
# checking if all the missing values were imputed in VALUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
Out[123]:
Count Percentage
YOJ 515 8.640940
DEBTINC 1267 21.258389

Missing value treatment for YOJ

In [124]:
# Plotting the distribution of "YOJ"
sns.histplot(data['YOJ'], bins=30, kde=True)
plt.show()

# Checking the summary statistics
print(data['YOJ'].describe())
count    5445.000000
mean        8.873159
std         7.430914
min         0.000000
25%         3.000000
50%         7.000000
75%        13.000000
max        28.000000
Name: YOJ, dtype: float64
In [125]:
sns.displot(data=data,x='YOJ',kind='kde')
plt.show()
  • As YOJ column is skewed, using average value for imputation might not be the correct method as mean gets impacted by outliers. So we will use median value to impute the missing values of this column as median is not affected by the outliers.
In [126]:
# checking the average number of YOJ
data.groupby(['REASON','JOB'])[['YOJ']].median()
Out[126]:
YOJ
REASON JOB
DebtCon Mgr 8.0
Office 5.0
Other 7.0
ProfExe 8.0
Sales 4.0
Self 5.0
HomeImp Mgr 8.0
Office 10.0
Other 7.0
ProfExe 7.0
Sales 4.0
Self 7.0
In [127]:
# imputing missing values in VALUE column
data['YOJ'] = data['YOJ'].fillna(value = data.groupby(['REASON','JOB'])['YOJ'].transform('median'))
In [128]:
#recheck the distribution after imputation of missing values
sns.displot(data=data,x='YOJ',kind='kde')
plt.show()
In [129]:
#change the datatype from float to int
data['YOJ'] = data['YOJ'].round().astype(int)
In [130]:
data.dtypes
Out[130]:
BAD          int64
LOAN         int64
MORTDUE    float64
VALUE      float64
REASON      object
JOB         object
YOJ          int64
DEROG        int64
DELINQ       int64
CLAGE        int64
NINQ         int64
CLNO         int64
DEBTINC    float64
dtype: object
In [131]:
# checking if all the missing values were imputed in VALUE
pd.DataFrame({'Count':data.isnull().sum()[data.isnull().sum()>0],'Percentage':(data.isnull().sum()[data.isnull().sum()>0]/data.shape[0])*100})
Out[131]:
Count Percentage
DEBTINC 1267 21.258389

Missing Values treatment for DEBTINC

In [132]:
# Plotting the distribution of "YOJ"
sns.histplot(data['DEBTINC'], bins=30, kde=True)
plt.show()

# Checking the summary statistics
print(data['DEBTINC'].describe())
count    4693.000000
mean       33.681973
std         7.135236
min        14.345367
25%        29.140031
50%        34.818262
75%        39.003141
max        53.797805
Name: DEBTINC, dtype: float64
In [133]:
data['DEBTINC'].value_counts()
Out[133]:
14.345367    59
53.797805    35
37.113614     1
41.824914     1
37.823173     1
             ..
35.512673     1
34.913793     1
41.077118     1
29.330142     1
34.571519     1
Name: DEBTINC, Length: 4601, dtype: int64
In [134]:
sns.displot(data=data,x='DEBTINC',kind='kde')
plt.show()

DEBTINC seems to be normally distributed, hence the mean can be used for imputation of missi ng values

In [135]:
# checking the average number of DEBTINC
data.groupby(['REASON','JOB'])[['DEBTINC']].mean()
Out[135]:
DEBTINC
REASON JOB
DebtCon Mgr 35.442766
Office 34.685394
Other 33.616029
ProfExe 32.581461
Sales 35.133052
Self 36.278200
HomeImp Mgr 34.639714
Office 33.537149
Other 33.352517
ProfExe 31.879843
Sales 39.301879
Self 32.907360
In [136]:
# imputing missing values in VALUE column
data['DEBTINC'] = data['DEBTINC'].fillna(value = data.groupby(['REASON','JOB'])['DEBTINC'].transform('mean'))
In [137]:
sns.displot(data=data,x='DEBTINC',kind='kde')
plt.show()
In [138]:
data.isnull().sum()
Out[138]:
BAD        0
LOAN       0
MORTDUE    0
VALUE      0
REASON     0
JOB        0
YOJ        0
DEROG      0
DELINQ     0
CLAGE      0
NINQ       0
CLNO       0
DEBTINC    0
dtype: int64

All the missing values have been treated and no missing values in our data, hence we can proceed to make our prediction model

Important Insights from EDA¶

What are the the most important observations and insights from the data based on the EDA performed?

From the EDA performed, we observed that

  1. Majority of the client that requested for loan job category was other, its possible that most of them were unemployed and this could explain why majority of the loan defaultment came from this category.
  2. Also, we observed that the major reason for obtaining the loan was for Debt consolidation, and they were with the highest number of loan defaultment. This could further expalin why majority of the loan was obtained by the clients under other category, its possible that they were unemployed or do not have a stable job. Likewise there was no significant mortgage amount for those who repaid their loan compared to those who defaulted.
  3. We also observed that DEROG, DELINQ, NINQ, DEBTINC had significant impact on loans defaulted by the applicants. The higher the DEROG DELINQ, NINQ, DEBTINC, the more loan defaultment.
  4. However, the lower the CLAGE (Age of the oldest credit line in months ), the more the applicants are likely to default the loan
In [ ]:
# saving the dataset with all the missing values treated
data.to_csv('/content/drive/MyDrive/Data science MIT/Capstone project/hmeq2.csv',index=False)

Model Building - Approach¶

  • Data preparation
  • Partition the data into train and test set
  • Build the model
  • Fit on the train data
  • Tune the model
  • Test the model on test set

Feature engineering¶

Creating dummy variables for the categorical variables

In [139]:
#check if there is balance distribution in the target variable
data['BAD'].value_counts(normalize=True)
Out[139]:
0    0.800503
1    0.199497
Name: BAD, dtype: float64

There is inbalance distribution between class 0 and class 1, from the above ~80% of the appliciant/client did not default on loan repayment while ~20 defaulted in the loan payment

In [140]:
#check the categorical column
cat_cols
Out[140]:
['REASON', 'JOB']
In [141]:
# Creating a list of columns for which we will create dummy variables
to_get_dummies_for = ['REASON', 'JOB']

# Creating dummy variables
df = pd.get_dummies(data = data, columns = to_get_dummies_for, drop_first = True)
In [142]:
df.head()
Out[142]:
BAD LOAN MORTDUE VALUE YOJ DEROG DELINQ CLAGE NINQ CLNO DEBTINC REASON_HomeImp JOB_Office JOB_Other JOB_ProfExe JOB_Sales JOB_Self
0 1 1100 25860.000000 39025.000000 10 0 0 94 1 9 33.352517 1 0 1 0 0 0
1 1 1300 70053.000000 68400.000000 7 0 0 122 0 14 33.352517 1 0 1 0 0 0
2 1 1500 13500.000000 16700.000000 4 0 0 149 1 10 33.352517 1 0 1 0 0 0
3 1 1500 61002.888087 85908.692319 7 0 0 167 1 19 33.616029 0 0 1 0 0 0
4 0 1700 97800.000000 112000.000000 3 0 0 93 0 14 33.537149 1 1 0 0 0 0

Preparing data for modeling Any other preprocessing steps (if needed)

Separating the independent variables (X) and the dependent variable (Y)

In [143]:
# Separating independent variables and the target variable
x = df.drop('BAD',axis=1)

y = df['BAD']

Scaling the data

  • The independent variables in this dataset have different scales. When features have different scales from each other, there is a chance that a higher weightage will be given to features that have a higher magnitude, and they will dominate over other features whose magnitude changes may be smaller but whose percentage changes may be just as significant or even larger. This will impact the performance of our machine learning algorithm, and we do not want our algorithm to be biased towards one feature.

  • The solution to this issue is Feature Scaling, i.e. scaling the dataset so as to give every transformed variable a comparable scale.

  • In this problem, we will use the Standard Scaler method, which centers and scales the dataset using the Z-Score.

  • It standardizes features by subtracting the mean and scaling it to have unit variance.

  • The standard score of sample x is calculated as:

    z = (x - u) / s

  • where u is the mean of the training samples (zero) and s is the standard deviation of the training samples.

In [144]:
# Scaling the data
sc = StandardScaler()

x_scaled = sc.fit_transform(x)

x_scaled = pd.DataFrame(x_scaled, columns = x.columns)

Splitting the data into 70% train and 30% test set

In [145]:
# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 1, stratify = y)
In [146]:
# Checking the shape of the Train and Test sets
print('X Train Shape:', x_train.shape);
print('X Test Shape:', x_test.shape);
print('Y Train Shape:', y_train.shape);
print('Y Test Shape:', y_test.shape);
X Train Shape: (4172, 16)
X Test Shape: (1788, 16)
Y Train Shape: (4172,)
Y Test Shape: (1788,)

Logistic Regression¶

Model evaluation criterion¶

While all the metrics (precision, recall and accuracy) are good for evaluating the performance of the model. Here our major focus would be on improving the recall because misclassifying a potential defaulter as a non-defaulter (i.e., a False Negative) can result in significant financial loss to the lending institution. A high recall ensures that a majority of the actual defaulters are correctly identified, minimizing such losses.¶

In [147]:
# Creating metric function

def metrics_score(actual, predicted):

    print(classification_report(actual, predicted))

    cm = confusion_matrix(actual, predicted)

    plt.figure(figsize = (8, 5))

    sns.heatmap(cm, annot = True, fmt = '.2f', xticklabels = ['0', '1'], yticklabels = ['0', '1'])
    plt.ylabel('Actual')

    plt.xlabel('Predicted')

    plt.show()
In [148]:
def model_performance_classification(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier

    predictors: independent variables

    target: dependent variable
    """

    # Predicting using the independent variables
    pred = model.predict(predictors)

    recall = recall_score(target, pred,average = 'macro')                 # To compute recall

    precision = precision_score(target, pred, average = 'macro')              # To compute precision

    acc = accuracy_score(target, pred)                                 # To compute accuracy score


    # Creating a dataframe of metrics

    df_perf = pd.DataFrame(
        {
            "Precision":  precision,
            "Recall":  recall,
            "Accuracy": acc,
        },

        index = [0],
    )

    return df_perf
  • Logistic Regression is a supervised learning algorithm, generally used for binary classification problems, i.e., where the dependent variable is categorical and has only two possible values.
In [149]:
# Fitting the logistic regression model
lg = LogisticRegression()

lg.fit(x_train,y_train)
Out[149]:
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()

Checking the model performance

In [150]:
# Checking the performance on the training data
y_pred_train = lg.predict(x_train)

metrics_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.80      1.00      0.89      3340
           1       0.00      0.00      0.00       832

    accuracy                           0.80      4172
   macro avg       0.40      0.50      0.44      4172
weighted avg       0.64      0.80      0.71      4172

In [151]:
# Checking the performance on the test dataset
y_pred_test = lg.predict(x_test)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.80      1.00      0.89      1431
           1       0.00      0.00      0.00       357

    accuracy                           0.80      1788
   macro avg       0.40      0.50      0.44      1788
weighted avg       0.64      0.80      0.71      1788

In [152]:
lg_test = model_performance_classification(lg,x_test,y_test)
lg_test
Out[152]:
Precision Recall Accuracy
0 0.400168 0.5 0.800336

Observation

  • The model is doing well in predicting class 0 but is failing to predict class 1 at all. This kind of behavior is typical in imbalanced datasets where one class (in this case, class 0) significantly outnumbers the other class.
  • The precision on the test data suggests that there's a 60% (1 - 0.40) chance that the model will predict clients or applicant to default loan payment and actually they would not.
  • Overall, the model is not good, this could be because the the target is not balanced or there is multicollinearity
In [153]:
# Fitting the logistic regression model with the target
lg_2 = LogisticRegression(class_weight='balanced')

lg_2.fit(x_train,y_train)
Out[153]:
LogisticRegression(class_weight='balanced')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced')
In [154]:
# Checking the performance on the training data
y_pred_train = lg_2.predict(x_train)

metrics_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.88      0.60      0.72      3340
           1       0.29      0.66      0.41       832

    accuracy                           0.62      4172
   macro avg       0.59      0.63      0.56      4172
weighted avg       0.76      0.62      0.65      4172

In [155]:
# Checking the performance on the test dataset
y_pred_test = lg_2.predict(x_test)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.88      0.58      0.70      1431
           1       0.29      0.68      0.41       357

    accuracy                           0.60      1788
   macro avg       0.58      0.63      0.55      1788
weighted avg       0.76      0.60      0.64      1788

In [156]:
lg_test_2 = model_performance_classification(lg_2,x_test,y_test)
lg_test
Out[156]:
Precision Recall Accuracy
0 0.400168 0.5 0.800336

Obsservation

  • This model is more balanced in its predictions. While it misclassifies a portion of class 0 instances as class 1 (reflected in the 29% precision for class 1), it has a much higher recall for class 1 (66%) compared to the model which had 0% recall for class 1.
  • The current model, while not perfect, is a significant improvement over the previous one that predicted everything as class 0.

Although we observed a bit improvement in the prediction of class 1 in both the train and test set, there was no improvement in the overall performance of the model.

  • Let's check if there is multcollineraity in the dataset by plotting the correlation graph
In [157]:
# Finding the correlation between various columns of the dataset
plt.figure(figsize = (15,7))
sns.heatmap(df.corr(), annot = True, vmin = -1, vmax = 1, fmt = ".2f", cmap = "Spectral")
Out[157]:
<Axes: >

There is strong correlation between VALUE and MORTDUE

  • Techniques like L1 (Lasso) and L2 (Ridge) regularization can help combat multicollinearity in logistic regression
  • Lets perform hyperparameter search to check which regulization is best for the model
In [158]:
# Define the hyperparameters and their possible values
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2', 'elasticnet'],
    'solver': ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga'],
    'class_weight': [None, 'balanced']
}

# Create a logistic regression model
lg_3 = LogisticRegression()

# Use GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(lg_3, param_grid, cv=5, scoring='accuracy')  # you can adjust the scoring metric
grid_search.fit(x_train, y_train)

# Print the best parameters
print(grid_search.best_params_)
{'C': 0.01, 'class_weight': None, 'penalty': 'l2', 'solver': 'newton-cg'}
In [159]:
lg_3 = LogisticRegression(C =0.01, class_weight= None, penalty= 'l2', solver= 'newton-cg')
lg_3.fit(x_train,y_train)
Out[159]:
LogisticRegression(C=0.01, solver='newton-cg')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.01, solver='newton-cg')
In [160]:
# Checking the performance on the training data
y_pred_train = lg_3.predict(x_train)

metrics_score(y_train, y_pred_train)
              precision    recall  f1-score   support

           0       0.81      0.99      0.89      3340
           1       0.72      0.05      0.10       832

    accuracy                           0.81      4172
   macro avg       0.76      0.52      0.49      4172
weighted avg       0.79      0.81      0.73      4172

In [161]:
# Checking the performance on the test dataset
y_pred_test = lg_3.predict(x_test)
metrics_score(y_test, y_pred_test)
              precision    recall  f1-score   support

           0       0.81      1.00      0.89      1431
           1       0.68      0.04      0.08       357

    accuracy                           0.80      1788
   macro avg       0.74      0.52      0.48      1788
weighted avg       0.78      0.80      0.73      1788

In [162]:
lg_test_3 = model_performance_classification(lg_3,x_test,y_test)
lg_test
Out[162]:
Precision Recall Accuracy
0 0.400168 0.5 0.800336

From the above model, there was improvement in precision for class 1 in both the train and test model. However, the recall and F1 score reduced.

  • The model does a good job of identifying class 0 instances but struggles significantly with class 1 instances. The high accuracy might seem promising at first, but the low recall for class 1 and the difference in F1-scores for the two classes indicate a model bias towards class 0.
  • Potential strategies to address this imbalance could include using different algorithms or techniques that handle imbalanced datasets better.
  • for now the logistics model with the balanced target variable seems to be the best logistics model Let check for the coeffiencients
In [163]:
# Printing the coefficients of logistic regression
cols = x.columns

coef_lg = lg_2.coef_

pd.DataFrame(coef_lg,columns = cols).T.sort_values(by = 0, ascending = False)
Out[163]:
0
DEBTINC 0.043204
NINQ 0.014584
CLNO 0.005580
REASON_HomeImp 0.001890
JOB_Other 0.000783
JOB_Self 0.000695
JOB_Sales 0.000503
VALUE 0.000003
DEROG 0.000000
DELINQ 0.000000
MORTDUE -0.000007
LOAN -0.000019
JOB_ProfExe -0.000775
JOB_Office -0.001471
CLAGE -0.005688
YOJ -0.014565

This table displays the coefficients of different features from a logistic regression model. Let's interpret these coefficients:

Positive Coefficients:

  • DEBTINC: The highest positive coefficient, suggesting that as DEBTINC (Debt-to-Income Ratio) increases, the log-odds of the positive class also increase. This could mean that higher debt relative to income might increase the likelihood of the event of interest (e.g., loan default).
  • NINQ: Represents the number of recent credit inquiries. A positive coefficient suggests that as the number of inquiries increases, the likelihood of the positive class increases.
  • CLNO: As the number of credit lines increases, the probability of the positive class increases.
  • REASON_HomeImp, JOB_Other, JOB_Self, JOB_Sales, VALUE: These also have positive coefficients, indicating they increase the log-odds of the positive class, though their effect sizes are smaller compared to the above features.

Zero Coefficients:

  • DEROG, DELINQ: These coefficients are 0, which suggests that during the training of this logistic regression model (possibly with regularization), these features were deemed not to provide any significant information in predicting the target variable.

Negative Coefficients:

  • CLAGE: The most negative coefficient. As the age of the oldest credit line in months increases, the log-odds of the positive class decrease. This suggests that older credit lines might be associated with a lower likelihood of the event of interest.
  • YOJ: Years at the current job. A negative coefficient indicates that as tenure at the current job increases, the probability of the positive class decreases.
  • LOAN, JOB_ProfExe, JOB_Office, MORTDUE: These features also decrease the likelihood of the positive class, with LOAN having the smallest negative effect.

Key Takeaways:

  • Positive coefficients increase the probability of the event of interest, while negative coefficients decrease it. The magnitude of the coefficient provides an idea of the strength of the relationship. For instance, DEBTINC has a much more substantial effect than VALUE.
  • While these coefficients give an idea of the direction and strength of the relationship, they should be interpreted in the context of the logistic function. For instance, a coefficient of 0.043204 for DEBTINC means that for a one-unit increase in DEBTINC, the log-odds of the positive class increase by 0.043204 units.
  • Features with coefficients close to 0 might not be as informative in predicting the target variable, especially if regularization was applied. Remember, the interpretation assumes all other variables are held constant, and it's crucial to understand the domain and context of the data to make meaningful conclusions.
In [164]:
odds = np.exp(lg_2.coef_[0]) # Finding the odds

# Adding the odds to a DataFrame and sorting the values
pd.DataFrame(odds, x_train.columns, columns = ['odds']).sort_values(by = 'odds', ascending = False)
Out[164]:
odds
DEBTINC 1.044151
NINQ 1.014691
CLNO 1.005595
REASON_HomeImp 1.001892
JOB_Other 1.000784
JOB_Self 1.000696
JOB_Sales 1.000503
VALUE 1.000003
DEROG 1.000000
DELINQ 1.000000
MORTDUE 0.999993
LOAN 0.999981
JOB_ProfExe 0.999225
JOB_Office 0.998530
CLAGE 0.994328
YOJ 0.985540

Positive Coefficients Comparing the results of the coeffients with the odds, we can see the same pattern.

  • Both indicate an increase in the likelihood of the positive class as the predictor increases. For instance, DEBTINC had a positive coefficient, and its odds ratio is 1.044151, which means for each unit increase in DEBTINC, the odds of the positive class increase by 4.4%.

Negative Coefficients

  • Also, both indicate a decrease in the likelihood of the positive class as the predictor increases. For example, YOJ had a negative coefficient, and its odds ratio is 0.985540. This means for each unit increase in YOJ, the odds of the positive class decrease by about 1.5%.

  • However, features like DEROG and DELINQ have coefficients close to 0, and their odds ratios are close to 1, indicating they might not be influential in this model.

The Precision-Recall Curve for Logistic Regression

In [165]:
y_scores_lg = lg_2.predict_proba(x_train) # predict_proba gives the probability of each observation belonging to each class


precisions_lg, recalls_lg, thresholds_lg = precision_recall_curve(y_train, y_scores_lg[:, 1])

# Plot values of precisions, recalls, and thresholds
plt.figure(figsize = (10, 7))

plt.plot(thresholds_lg, precisions_lg[:-1], 'b--', label = 'precision')

plt.plot(thresholds_lg, recalls_lg[:-1], 'g--', label = 'recall')

plt.xlabel('Threshold')

plt.legend(loc = 'upper left')

plt.ylim([0, 1])

plt.show()

Let's find out the performance of the model at this threshold.

In [166]:
optimal_threshold1 = .61

y_pred_train = lg_2.predict_proba(x_train)

metrics_score(y_train, y_pred_train[:, 1] > optimal_threshold1)
              precision    recall  f1-score   support

           0       0.84      0.85      0.84      3340
           1       0.36      0.35      0.36       832

    accuracy                           0.75      4172
   macro avg       0.60      0.60      0.60      4172
weighted avg       0.74      0.75      0.75      4172

In [167]:
optimal_threshold1 = .61

y_pred_test = lg_2.predict_proba(x_test)

metrics_score(y_test, y_pred_test[:, 1] > optimal_threshold1)
              precision    recall  f1-score   support

           0       0.84      0.85      0.85      1431
           1       0.37      0.34      0.35       357

    accuracy                           0.75      1788
   macro avg       0.60      0.60      0.60      1788
weighted avg       0.74      0.75      0.75      1788

Observation

  • The model performs relatively well in predicting the (Class 0) with a precision of 84% and recall of 85%.
  • However, for the (Class 1), both precision and recall are considerably lower. This suggests that while the model is fairly good at identifying negative instances, it struggles more with positive instances.
  • The adjusted threshold seems to provide a balance between predicting both classes. While the precision for class 1 isn't very high (36%), the model does manage to identify 35% of the actual class 1 instances.
  • This approach and threshold might be beneficial if the goal is to strike a balance between identifying class 0 and class 1 instances. However, the model is very good. Let check another alogrithm Decision Tree model

Decision Tree¶

In [168]:
# Building decision tree model
dt = DecisionTreeClassifier(random_state = 1)
In [169]:
# Fitting decision tree model
dt.fit(x_train, y_train)
Out[169]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)

Let's check the model performance of decision tree

In [170]:
# Checking performance on the training dataset
y_train_pred_dt = dt.predict(x_train)

metrics_score(y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

Observation:

  • The Decision tree is giving a 100% score for all metrics on the training dataset.
  • This model is overfitting. This means that while it performs perfectly on the current dataset, it might not generalize well to test, or unseen data.
In [171]:
# Checking performance on the test dataset
y_test_pred_dt = dt.predict(x_test)

metrics_score(y_test, y_test_pred_dt)
              precision    recall  f1-score   support

           0       0.90      0.91      0.90      1431
           1       0.62      0.61      0.61       357

    accuracy                           0.85      1788
   macro avg       0.76      0.76      0.76      1788
weighted avg       0.85      0.85      0.85      1788

In [172]:
dtree_test = model_performance_classification(dt,x_test,y_test)
dtree_test
Out[172]:
Precision Recall Accuracy
0 0.761261 0.758502 0.847315

Observations:

  • The Decision Tree works well on the training data but not so well on the test data as the recall is 0.61 in comparison to 1 for the training dataset, i.e., the Decision Tree is overfitting the training data.
  • The precision on the test data suggests that there's a 36% (1 - 0.62) chance that the model will predict that leads would converted to a paid customer and not
  • For the positive class (Class 1), while the precision and recall are considerably lower than for class 0, they are still reasonably good. An F1-score of 61% for the positive class suggests that the model achieves a balance between precision and recall, but there's room for improvement
  • Even though the decision tree model seems to be overfitting for the train data, it was still better than the logistic regression model There was increase in the overall performance of the model

Let's plot the feature importance and check the most important features.¶

In [173]:
# Plot the feature importance

importances = dt.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

#sns.barplot(importance_df.Importance,importance_df.index)
sns.barplot(x=importance_df.Importance, y=importance_df.index)
Out[173]:
<Axes: xlabel='Importance'>

Observation

  • The DEBITINC is the most important variable with a high significance when compared to other variables.Other important features includes, CLAGE, LOAN, VALUE and CLNO
  • DEBTINC was also the most important feature with the logistics regression model.
  • let's add weight to the decision tree model and tune the model to see if there would be any improvement

Plotting the decision tree¶

In [174]:
features = list(x.columns)

plt.figure(figsize = (30, 20))

tree.plot_tree(dt, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)

plt.show()

Observation¶

Blue leaves represent the leads, i.e., y[1] and the orange leaves represent the non-leads, i.e., y[0]. Also, the more the number of observations in a leaf, the darker its color gets.

  • Clients or applicants with DEBTINC less than or equal to 45 but with CLAGE less than 243 have a higher chance of repaying their loans

Decision Tree - Hyperparameter Tuning¶

  • Hyperparameter tuning is tricky in the sense that there is no direct way to calculate how a change in the hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. We'll use Grid search to perform hyperparameter tuning.
  • Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters.
  • It is an exhaustive search that is performed on the specific parameter values of a model.
  • The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

Criterion {“gini”, “entropy”}

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain.

max_depth

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

min_samples_leaf

The minimum number of samples is required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

You can learn about more Hyperpapameters on this link and try to tune them.

https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

Tuning Models¶

In [175]:
# check if there is imbalance in the target variable
data['BAD'].value_counts(1)
Out[175]:
0    0.800503
1    0.199497
Name: BAD, dtype: float64

There is imbalance in the target variable

  • To balance it, we would add class weight of 0.20 to class 0 and 0.80 to class
In [176]:
# Choose the type of classifier
dtree_estimator = DecisionTreeClassifier(class_weight = {0: 0.20, 1: 0.80}, random_state = 1)

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(2, 8),
              'criterion': ['gini', 'entropy'],
              'min_samples_leaf': [5, 10, 20, 25]
             }

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
gridCV = GridSearchCV(dtree_estimator, parameters, scoring = scorer, cv = 10)

# Fitting the grid search on the train data
gridCV = gridCV.fit(x_train, y_train)

# Set the classifier to the best combination of parameters
dtree_estimator = gridCV.best_estimator_

# Fit the best estimator to the data
dtree_estimator.fit(x_train, y_train)
Out[176]:
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
                       max_depth=5, min_samples_leaf=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
                       max_depth=5, min_samples_leaf=5, random_state=1)
In [177]:
# Checking performance on the training dataset
y_train_pred_dt = dtree_estimator.predict(x_train)

metrics_score(y_train, y_train_pred_dt)
              precision    recall  f1-score   support

           0       0.94      0.73      0.82      3340
           1       0.43      0.81      0.56       832

    accuracy                           0.74      4172
   macro avg       0.68      0.77      0.69      4172
weighted avg       0.84      0.74      0.77      4172

Observation:

  • In comparison to the model with default values of hyperparameters, the performance on the training set has gone down significantly. This makes sense because we are trying to reduce overfitting.
  • The model is very precise in predicting class 0 however, for the class 1, while the recall is high, the precision is low. This suggests the model is classifying many instances as class 1 to ensure it captures most of the actual class 1 instances, but in the process, it's misclassifying many class 0 instances.
In [178]:
# Checking performance on the test dataset
y_test_pred_dt = dtree_estimator.predict(x_test)

metrics_score(y_test, y_test_pred_dt)
              precision    recall  f1-score   support

           0       0.93      0.72      0.81      1431
           1       0.42      0.79      0.55       357

    accuracy                           0.74      1788
   macro avg       0.67      0.76      0.68      1788
weighted avg       0.83      0.74      0.76      1788

In [179]:
dtree_tuned_test = model_performance_classification(dtree_estimator,x_test,y_test)
dtree_tuned_test
Out[179]:
Precision Recall Accuracy
0 0.674359 0.757295 0.736018

Observation

  • The model demonstrates high precision for class 0 but could improve in terms of recall.
  • For class 1, The recall is high while the precision and F1 score is low, it seems there's a clear trade-off. The model is trying to identify as many class 1 instances as possible, resulting in a high recall. However, this comes at the expense of precision, leading to a significant number of false positives.
  • The overall accuracy of 74% suggests the model's predictions are mostly correct, but there's room for improvement, especially in optimizing the balance between precision and recall for class 1.

Let's plot the feature importance and check the most important features.¶

In [180]:
importances = dtree_estimator.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

Observation

  • The DEBITINC is the most important variable with a high significance when compared to other variables.Other important features includes, CLAGE, MORTDUE, LOAN and JOB_Self
  • DEBTINC was also the most important feature with the default decsision tree model.

Lets plot the tree¶

In [181]:
features = list(x.columns)

plt.figure(figsize = (30, 20))

tree.plot_tree(dtree_estimator, max_depth = 4, feature_names = features, filled = True, fontsize = 12, node_ids = True, class_names = True)

plt.show()

Observation

  • From the tree, DEBTINC played a major role in the determining if the applicant or client would default or not
  • Applicant wih DEBTINC less than 35.44 and loan less than 15050 are likely going to repay their loans
  • Applicant with DEBTINC less than 14.33 with NINQ less than 1.5 have higher chances of repaying the loans

Building a Random Forest Classifier¶

Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.

The results from all the decision trees are combined together and the final prediction is made using voting or averaging.

In [182]:
# Fitting the Random Forest classifier on the training data
rf_estimator = RandomForestClassifier( random_state = 1)

rf_estimator.fit(x_train, y_train)
Out[182]:
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)
In [183]:
# Checking performance on the training data
y_pred_train_rf = rf_estimator.predict(x_train)

metrics_score(y_train, y_pred_train_rf)
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3340
           1       1.00      1.00      1.00       832

    accuracy                           1.00      4172
   macro avg       1.00      1.00      1.00      4172
weighted avg       1.00      1.00      1.00      4172

Observation:

  • The Random Forest is giving a 100% score for all metrics on the training dataset.
In [184]:
# Checking performance on the testing data
y_pred_test_rf = rf_estimator.predict(x_test)

metrics_score(y_test, y_pred_test_rf)
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      1431
           1       0.87      0.54      0.67       357

    accuracy                           0.89      1788
   macro avg       0.88      0.76      0.80      1788
weighted avg       0.89      0.89      0.88      1788

In [185]:
rf_estimator_test = model_performance_classification(rf_estimator,x_test,y_test)
rf_estimator_test
Out[185]:
Precision Recall Accuracy
0 0.884894 0.761925 0.893177

Observations:

  • The Random Forest classifier seems to be overfitting the training data. The recall and F1 score on the training data is 0.98 and0.94, while the recall on the test data is only ~ 0.54 and F1score is ~0.64 for class 1. -The model performs very well in predicting class 0 with both high precision and recall.
  • For class 1, while precision is high, recall is considerably lower. This indicates that the model is missing out on classifying a significant portion of actual class 1 instances correctly.
  • The overall accuracy of 89% on the test set suggests a robust model, especially given that the test set contains unseen data.
In [186]:
importances = rf_estimator.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

DEBTINC was also the most important feature with the default random forest model.

Random Forest Classifier Hyperparameter Tuning¶

Tuning the Random Forest classifier

In [187]:
# Choose the type of classifier
rf_estimator_tuned = RandomForestClassifier(class_weight = {0: 0.2, 1: 0.8}, random_state = 1)

# Grid of parameters to choose from
params_rf = {
        "n_estimators": [100, 250, 500],
        "min_samples_leaf": np.arange(1, 4, 1),
        "max_features": [0.7, 0.9, 'auto'],
}


# Type of scoring used to compare parameter combinations - recall score for class 1
scorer = metrics.make_scorer(recall_score, pos_label = 1)

# Run the grid search
grid_obj = GridSearchCV(rf_estimator_tuned, params_rf, scoring = scorer, cv = 5)

grid_obj = grid_obj.fit(x_train, y_train)

# Set the classifier to the best combination of parameters
rf_estimator_tuned = grid_obj.best_estimator_
In [188]:
rf_estimator_tuned.fit(x_train, y_train)
Out[188]:
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_features='auto',
                       min_samples_leaf=3, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, max_features='auto',
                       min_samples_leaf=3, random_state=1)
In [189]:
# Checking performance on the training data
y_pred_train_rf_tuned = rf_estimator_tuned.predict(x_train)

metrics_score(y_train, y_pred_train_rf_tuned)
              precision    recall  f1-score   support

           0       1.00      0.99      0.99      3340
           1       0.95      0.99      0.97       832

    accuracy                           0.99      4172
   macro avg       0.98      0.99      0.98      4172
weighted avg       0.99      0.99      0.99      4172

Observation

  • The model exhibits extraordinary performance for both classes, with both precision and recall being very high.
  • Both negative (Class 0) and positive (Class 1) instances are predicted with a high degree of accuracy.
  • The overall accuracy of 99% on the test set suggests the model's predictions are nearly flawless on this particular test data.
In [190]:
# Checking performance on the test data
y_pred_test_rf_tuned = rf_estimator_tuned.predict(x_test)

metrics_score(y_test, y_pred_test_rf_tuned)
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      1431
           1       0.79      0.64      0.71       357

    accuracy                           0.90      1788
   macro avg       0.85      0.80      0.82      1788
weighted avg       0.89      0.90      0.89      1788

In [191]:
rf_estimator_tuned_test = model_performance_classification(rf_estimator_tuned, x_test, y_test)
rf_estimator_tuned_test
Out[191]:
Precision Recall Accuracy
0 0.854162 0.801164 0.895414
In [192]:
importances = rf_estimator_tuned.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

DEBTINC was also the most important feature with the default random forest model.

Observations:

  • The tuned Random Forest classifier seems to be better than the default random forest model. There seems to be an improvement in recall and F1-score for class 1 when compared to the default model.
  • The model performs strongly in predicting class 0, both in terms of precision and recall.
  • For class 1, while the precision is quite good, the recall indicates that there's a significant portion of class 1 instances that the model is not capturing. This suggests that the model might be more conservative in predicting class 1, resulting in more false negatives for this class.

Additional Models¶

Boosting Models¶

Let's now look at the other kind of Ensemble technique knowns as Boosting

XGBoost¶

  • XGBoost stands for Extreme Gradient Boosting.
  • XGBoost is a tree-based ensemble machine learning technique that improves prediction power and performance by improvising on the Gradient Boosting framework and incorporating reliable approximation algorithms. It is widely utilized and routinely appears at the top of competition leader boards in data science.
In [193]:
# Installing the xgboost library using the 'pip' command.
!pip install xgboost
Requirement already satisfied: xgboost in /usr/local/lib/python3.10/dist-packages (2.0.0)
Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.23.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from xgboost) (1.11.3)
In [194]:
# Importing the AdaBoostClassifier and GradientBoostingClassifier [Boosting]
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier

# Importing the XGBClassifier from the xgboost library
from xgboost import XGBClassifier
In [195]:
# Adaboost Classifier
adaboost_model = AdaBoostClassifier(random_state = 1)

# Fitting the model
adaboost_model.fit(x_train, y_train)

# Model Performance on the test data
adaboost_model_perf_test = model_performance_classification(adaboost_model,x_test,y_test)


adaboost_model_perf_test
Out[195]:
Precision Recall Accuracy
0 0.81559 0.676151 0.853468
In [196]:
importances = adaboost_model.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

DEBTINC was also the most important feature with the default random forest model.

In [197]:
# Gradient Boost Classifier
gbc = GradientBoostingClassifier(random_state = 1)

# Fitting the model
gbc.fit(x_train, y_train)

# Model Performance on the test data
gbc_perf_test = model_performance_classification(gbc, x_test, y_test)

gbc_perf_test
Out[197]:
Precision Recall Accuracy
0 0.853519 0.725517 0.87528
In [198]:
importances = gbc.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

DEBTINC was also the most important feature with the default random forest model.

In [199]:
# XGBoost Classifier
xgb = XGBClassifier(random_state = 1, eval_metric = 'logloss')

# Fitting the model
xgb.fit(x_train,y_train)

# Model Performance on the test data
xgb_perf_test = model_performance_classification(xgb,x_test,y_test)

xgb_perf_test
Out[199]:
Precision Recall Accuracy
0 0.866014 0.792752 0.897092
In [200]:
importances = xgb.feature_importances_

columns = x.columns

importance_df = pd.DataFrame(importances, index = columns, columns = ['Importance']).sort_values(by = 'Importance', ascending = False)

plt.figure(figsize = (13, 13))

sns.barplot(x=importance_df.Importance, y=importance_df.index);

DEBTINC was also the most important feature with the default random forest model.

Comparison of all the models we have built so far¶

In [201]:
models_test_comp_df = pd.concat(

    [
    lg_test_2.T, lg_test_3.T, dtree_test.T, dtree_tuned_test.T,rf_estimator_test.T,
    rf_estimator_tuned_test.T, adaboost_model_perf_test.T,
    gbc_perf_test.T, xgb_perf_test.T
    ],

    axis = 1,
)

models_test_comp_df.columns = [
    "Logistic Regression",
    "Tuned Logistic regression",
    "Decision Tree classifier",
    "Tuned Decision Tree classifier",
    "Random Forest classifier",
    "Tuned Random Forest classifier",
    "Adaboost classifier",
    "Gradientboost classifier",
    "XGBoost classifier"
]
In [202]:
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
Out[202]:
Logistic Regression Tuned Logistic regression Decision Tree classifier Tuned Decision Tree classifier Random Forest classifier Tuned Random Forest classifier Adaboost classifier Gradientboost classifier XGBoost classifier
Precision 0.584698 0.744080 0.761261 0.674359 0.884894 0.854162 0.815590 0.853519 0.866014
Recall 0.632093 0.518563 0.758502 0.757295 0.761925 0.801164 0.676151 0.725517 0.792752
Accuracy 0.601230 0.804810 0.847315 0.736018 0.893177 0.895414 0.853468 0.875280 0.897092

Observations:

  • The Tuned Random Forest gives the best performance for this dataset.
  • In boosting algorithms, XGBoost has given good score amoong other algorithms.

1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):

  • How do different techniques perform? Which one is performing relatively better? Is there scope to improve the performance further?

2. Refined insights:

  • What are the most meaningful insights relevant to the problem?
  • From all the models, we found DEBTINC to be an importance feature that determines if an applicants or client would default loan payment or not.
  • Other importance feature detected by the models are CLAGE, LOAN, MORTDUE, VALUE, NINQ, JOB_self

3. Proposal for the final solution design:

  • What model do you propose to be adopted? Why is this the best solution to adopt?
  • From all the model checked, tuned random forest model is the best model to use to predict if the applicants would default or not.There seems to be an improvement in recall and F1-score for class 1 when compared to the other model.
  • Overall, the overall accuracy of 90% indicates a robust model performance on unseen data.

Conclusion¶

  • The best model we have got so far is the tuned random forest model which is giving nearly ~ 80%. This indicates that ensemble methods can capture more complex patterns in the data compared to simpler models like logistic regression.
  • The company should use this model to know beforehand which applicant is going to default loan and act accordingly.
  • DEBTINC, CLAGE, LOAN,MORTDUE, and VALUE are the most important features. Hence, should be the importance features to consider when approving loans

-Given the importance of correctly identifying bad loans, focusing on recall is crucial. While the Tuned Random Forest model showed promise, there's still room for improvement

Recommendations¶

  • We saw that DEBTINC (Debt-Income-ratio) is the most imoportant feature in all the models. Hence, the company should ensure DEBTINC is one the priority features to be checked before approving loans
  • More resources should be allocated to determine the DEBTINC of each client before approving loans.

  • There was little impact of DEROG and DELINQ on loan defaultment. Hence the amount of resources allocated for checking this features should be reduced.